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Abstract 

In this work we show that, using the eigen-decomposition of the adjacency 
matrix, we can consistently estimate latent positions for random dot product 
graphs provided the latent positions are i.i.d. from some distribution. If class 
labels are observed for a number of vertices tending to infinity, then we show 
that the remaining vertices can be classified with error converging to Bayes opti- 
mal using the fc-nearest-neighbors classification rule. We evaluate the proposed 
methods on simulated data and a graph derived from Wikipedia. 



1 Introduction 

The classical statistical pattern recognition setting involves 

(X, Y), (X lt Y{), . . ., {X n , Y n ) U ~- Fx.y, 

where the X, • : fi <—> R d are observed feature vectors and the Y, : fi >— > {0, 1} are ob- 
served class labels for some probability space fi. We define @ = {{X, ■, Yj)} as the train- 
ing set. The goal is to learn a classifier h{-; 2?) : R d — > {0, 1} such that the probability of 
error F[h(X; @) ^ Y\@] approaches Bayes optimal as n —> oo for all distributions Fx,y 



universal consistency (Devroye et al. 1996) . Here we consider the case wherein 



the feature vectors X, X\ , . . . , X n are unobserved, and we observe instead a latent po- 
sition graph G(X,Xi,. . .,X n ) on n + 1 vertices. We show that a universally consistent 
classification rule (specifically, k -nearest neighbors) remains universally consistent 
for this extension of the pattern recognition set up to latent position graph models. 



Latent space models for random graphs Hoff et al. 2002J offer a framework in 
which a graph structure can be parametrized by latent vectors associated with each 
vertex. Then, the complexities of the graph structure can be characterized usings 
well-known techniques for vector spaces. One approach, which we adopt here, is 
that given a latent space model for a graph, we first estimate the latent positions and 



then use the estimated latent positions to perform subsequent analysis. When the 
latent vectors determine the distribution of the random graph, accurate estimates 
of the latent positions will often lead to accurate subsequent inference. 

In particular, this paper considers the random dot product graph model intro- 
duced in Nickel 12006) and | Young and Scheinerman (2007) . This model supposes 
that each vertex is associated with a latent vector in R d . The probability that two 
vertices are adjacent is then given by the dot product of their respective latent vec- 
tors. We investigate the use of an eigen-decomposition of the observed adjacency 
matrix to estimate the latent vectors. The motivation for this estimator is that, had 
we observed the expected adjacency matrix (the matrix of adjacency probabilities), 
then this eigen-decomposition would return the original latent vectors (up to an or- 
thogonal transformation) . 

Provided the latent vectors are i.i.d. from any distribution F on a suitable space 
3C ', we show that we can accurately recover the latent positions. Because the graph 
model is invariant to orthogonal transformations of the latent vectors, note that the 
distribution F is identifiable only up to orthogonal transformations. Consequently, 
our results show only that we estimate latent positions which can then be orthogo- 
nally transformed to be close to the true latent vectors. As many subsequent infer- 
ence tasks are invariant to orthogonal transformations, it is not necessary to achieve 
a rotationally accurate estimate of the original latent vectors. 

For this paper, we investigate the inference task of vertex classification. This 
supervised or semi-supervised problem supposes that we have observed class labels 
for some subset of vertices and that we wish to classify the remaining vertices. To 
do this, we train a k -nearest-neighbor classifier on estimated latent vectors with 
observed class labels, which we then use to classify vertices with un-observed class 
labels. Our result states that this classifier is universally consistent, meaning that 
regardless of the distribution for the latent vectors, the error for our classifier trained 
on the estimated vectors converges to Bayes optimal for that distribution. 

The theorems as stated can be generalized in various ways without much addi- 
tional work. For ease of notation and presentation, we chose to provide an illus- 
trative example for the kind of results that can be achieved for the specific random 
dot product model. In the discussion we point out various ways that this can be 
generalized. 

The remainder of the paper is structured as follows. Section[2]discusses previous 
work related to the latent space approach and spectral properties of random graphs. 
In section|3] we introduce the basic framework for random dot product graphs and 
our proposed latent position estimator. In section|4] we argue that the estimator is 
consistent, and in section|5]we show that the k -nearest- neighbors algorithm yields 
consistent vertex classification. In section|6]we consider some immediate ways the 
results presented herein can be extended and discuss some possible implications. 
Finally, section[7]provides illustrative examples of applications of this work through 
simulations and a graph derived from Wikipedia articles and hyper-links. 
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2 Related Work 



The latent space approach is introduced in Hoff et al. (2002} . Generally, one posits 
that the adjacency of two vertices is determined by a Bernoulli trial with parameter 
depending only on the latent positions associated with each vertex, and edges are 
independent conditioned on the latent positions of the vertices. 

If we suppose that the latent positions are i.i.d. from some distribution, then 
the latent space approach is closely related to the theory of exchangeable random 



graphs Aldous 



1981 



Bickel and Chen 



2009 



Kallenberg 2005 . For exchangeable 



graphs, we have a (measurable) link function g : [0, l] 2 >— > [0, 1] and each vertex is as- 
sociated with a latent i.i.d. uniform [0, 1] random variable denoted Xt. Conditioned 
on the {X, ■}, the adjacency of vertices i and j is determined by a Bernoulli trial with 
parameter g[Xi,Xj). For a treatment of exchangeable graphs and estimation using 
the method of moments, see Bick el et aLl (20 1 1) . 

The latent space approach replaces the latent uniform random variables with 
random variables in some X c R d , and the link function g has domain X 2 . These 
random graphs still have exchangeable vertices and so could be represented in the 
i.i.d. uniform framework. On the other hand, d -dimensional latent vectors allow for 
additional structure and advances interpretation of the latent positions. 

In fact, the following result provides a characterization of finite-dimensional ex- 
changeable graphs as random dot product graphs. First, we say g is rank d < oo 
and positive semi-definite if g can be written as g[x,y) — X,-=i ipiW^PAy) f° r some 
linearly independent functions ipj : [0, 1] <—> [—1, 1]. Using this definition and the 
inverse probability transform, one can easily show the following. 

Proposition 2.1. An exchangeable random graph has rank d < oo and positive semi- 
definite link function if and only if the random graph is distributed according to a 
random dot product graph with i.i.d. latent vectors inR d . 

Put another way, random dot products graphs are exactly the finite -dimensional ex- 
changeable random graphs, and hence, they represent a key area for exploration 
when studying exchangeable random graphs. 



An important example of a latent space model is the stochastic blockmodel Hol- 
land et al"j 1983}, where each latent vector can take one of only b distinct values. 
The latent positions can be taken to be X — [b] — { 1, . . . , b] for some positive integer 
b, the number of blocks. Two vertices with the same latent position are said to be 
members of the same block, and block membership of each vertex determines the 
probabilities of adjacency. Vertices in the same block are said to be stochastically 
equivalent. This model has been studied extensively, with many efforts focused on 



unsupervised estimation of vertex block membership {Bickel and Chen 2009 Choi 



etal. 2012: Snijders and Nowicki 1997) . Note that[Sussman et al. Ilnpressl discusses 
the relationship between stochastic blockmodels and random dot product graphs. 
The value of the stochastic blockmodel is its strong notions of communities and par- 
simonious structure; however the assumption of stochastic equivalence may be too 
strong for many scenarios. 

Many latent space approaches seek to generalize the stochastic blockmodel to 
allow for variation within blocks. For example, the mixed membership model of 
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Airoldi et al. 2008 1 posits that a vertex could have partial membership in multiple 
blocks. In Handcock et al. j2007) , latent vectors are presumed to be drawn from a 
mixture of multivariate normal distributions with the link function depending on 
the distance between the latent vectors. They use Bayesian techniques to estimate 
the latent vectors. 

Our work relies on techniques developed in |Rohe et al. (2011) a nd Sussmanet al.| 
(In press I to estimate latent vectors. In particular, Rohe et al. 2011} prove that 
the eigenvectors of the normalized Laplacian can be orthogonally transformed to 
closely approximate the eigenvectors of the population Laplacian. Their results do 



not use a specific model but rather rely on assumptions for the Laplacian. |Sussman 



et al. In press) shows that for the directed stochastic blockmodel, the eigenvec- 



tors/singular vectors of the adjacency matrix can be orthogonally transformed to 
approximate the eigenvectors/singular vectors of the population adjacency matrix. 
Fishkind et al.| (2012) extends these results to the case when the number of blocks 
in the stochastic blockmodel are unknown. Marche tte et al.| (2011) ) also uses tech- 
niques closely related to those presented here to investigate the semi-supervised 
vertex nomination task. 

Finally, another line of work is exemplified by Oliveira 120091. This work shows 
that, under the independent edge assumption, the adjacency matrix and the nor- 
malized Laplacian concentrate around the respective population matrices in the 
sense of the induced L 2 norm. This work uses techniques from random matrix the- 
ory. Other work, such as Chun g et al.|(2004) , investigates the spectra of the adjacency 
and Laplacian matrices for random graphs under a different type of random graph 
model. 



3 Framework 

Let M n [A) and ^M nm (A) denote the set ofnxn and nx m matrices with values in 
A for some set A. Additionally, for M e ^#„(R), let A,(M) denote the eigenvalue of M 
with the I th largest magnitude. All vectors are column vectors. 

Let X be a subset of the unit ball S8(0, 1) c R d such that (x Y ,x 2 ) e [0, 1], for all 
X\,x 2 e X where {•,•) denotes the standard Euclidean inner product. Let F be a prob- 
ability measure on X and let X,X lt X 2 , . . . ,X n '~ F. Define X := [X lt X 2 ,...,X n ] T : 

n -» «#n,dW and p : = xx 7 : n -* 

We assume that the (second moment) matrix E[XiX 7 ] e ^^(R) is rank d and 
has distinct eigenvalues {A;(E[XJf T ])}. In particular, we suppose there exists 5 > 
such that 

25<min|A i (E[XX T ])-A J (E[XX T ])| and 25 < X d {E[XX T ]). (1) 

Remark 3.1. The distinct eigenvalue assumption is not critical to the results that 
follow but is assumed for ease of presentation. The theorems hold in the general 
case with minor changes. 

Additionally, we assume that the dimension d of the latent positions is known. 
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Let A be a random symmetric hollow matrix such that the entries {A i; };<j are 
independent Bernoulli random variables with P[Ay = 1] = Py for all i,j e [n], i <j. 
We will refer to A as the adjacency matrix that corresponds to a graph with vertex set 
{l,...,n}. Let UaSaUJ be the eigen-decomposition of |A| where |A| = (AA T ) 1/2 with 
Sa having positive decreasing diagonal entries. Let Ua e (R) be given by the 
first d columns of Ua e .4t n (fil) and let Sa £ ^d(R) be given by the first d rows and 
columns of Sa e ..•#„(!*)• Let Up and Sp be defined similarly. 



4 Estimation of Latent Positions 

The key result of this section is the following theorem which shows that, using the 
eigen-decomposition of |A|, we can accurately estimate the true latent positions up 
to an orthogonal transformation. 

Theorem 4.1. With probability greater than 1 — 2 ^ n2 +1 * , there exists an orthogonal 
matrix W e Ma (R) such that 



||U A S A /2 W-X||<2^^f^. (2) 

Let W be as above and define X = UvS A W with row i denoted by X, . Then, for each 
i e [n] and all y < 1, 

P[||Zi-Zi|| 2 > n-r] = 0{nr- 1 \ogn). (3) 



We now proceed to prove this result. First, the following result, proved in Suss- 



man et al. (In press , provides a useful Frobenius bound for the difference between 



A 2 andP 2 . 



Proposition 4.2 I Sussman et al. In press} ). For A and P as above, it holds with prob- 



ability greater than 1 — \ that 



||A 2 -P 2 || F < v / 3" 3 log«. (4) 

The proof of this theorem is omitted and uses the same Hoeffding bound as is 
used to prove Eq. (7) below. 

Proposition 4.3. For i <d, it holds with probability greater than 1 — tnat 

|A,-(P) - nXi[E[XX T ])\ < 2d 2 ^nlogn, (5) 

and for i > d, = 0. IfEq. (5) holds, then for i,j < d + 1, i j£ j and 5 satisfying 
Eq. (1} and n sufficiently large, we have 

|A,(P)-A ; (P)|>5«. (6) 
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Proof. First, A,(P) = A ; (XX T ) = A;(X T X) for i < d. Note each entry of X T X is the 
sum of n independent random variables each in [—1, 1]: X T X l; = ^," =1 XuX[j. This 
means we can apply Hoeffding's inequality to each entry of X T X— nE[XX T ] to obtain 

2 

P[|(X T X- rcE[XX T ]), 7 | > 2^n\ogn] < — . (7) 

Using a union bound we have that P[||X T X-E[XX T ]|| F > 2d 2 ^ n logrc] < Using 
Weyl's inequality (Hor n and Johnsonj|1985) , we have the result. 

Eq. {6} follows from Eq. (5) provided 2d 2 ^j nlogn < n5, which is the case for n 
large enough. □ 

This next lemma shows that we can bound the difference between the eigenvec- 
tors of A and P, while our main results are for scaled versions of the eigenvectors. 

Lemma 4.4. With probability greater than 1 — 2d ^ 2 +1 , there exists a choice for the signs 
of the columns o/Ua such that for each i <d, 



y31ogn 



Proof. This is a result of applying the Davis-Kahan Theorem (Davis and Kahan (1970 



see also Rohe et al. (2011) ) to A and P. Proposition 4.2 and 4.3 give that the eigen 



value gap for P 2 is greater than 5 2 n 2 and that ||A 2 — P 2 ||f < ^3n 3 logn with proba- 
bilty greater then 1 — ^pr^. Apply the Davis-Kahan theorem to each eigenvector of 
A and P, which are the same as the eigenvectors of A 2 and P 2 , respectively, to get 



' 3 log n 



min , ||(Ua),- - (U P ).ml|F < \/ (9) 



for each i < d. The claim then follows by choosing Ua so that r,- = 1 minimizes 
Eq. (9} for each i < d. □ 

We now have the ingredients to prove our main theorem. 



Proof of Theorem \4J\ The following argument assumes that Eqs. (8) and (6) hold, 
whicl 
have 



which occurs with probability greater than 1 — 2< - d2 +^ . By the triangle inequality, we 



IUaS^ 2 — UpSp^Hf < HUaS^ 2 — UASp^Hi; + I|UaS p ' /2 — UpSj/ 2 ||p 
= ||U A (S A /2 - S P /2 )|| F + ||(U A - Up)S p /2 || f . 



(10) 



Note that 



v? v? A 2 (|A|)-A 2 (P) 

/I ' f IAI1— /I ' fPl = fill 

U U 1 1 J (Ai(|A0 + A,-(P))(Ai(|A|) 1 '' 2 + \i{P) l l 2 ) U1J 



where the numerator of the right hand side is less than ^/3n 3 logn by Proposition 



and the denominator is greater than [5n) 3 / 2 by Proposition 4.3 The first term in 



4.2 



Eq. (TO) is thus bounded by d^3logn/5 3 . For the second term, (Sp),; < n and 



6 



I|Ua-Up|| 

2d 2 + l 



< 



We have established that with probability greater than 1 



IIUaS^' 



1/2 



U P S P /2 | 



■<2d 



3\ogn 



(12) 



We now will show that an orthogonal transformation will give us the same bound 
in terms of X. Let Y = U P Sp /2 . Then YY T = P = XX T and thus YY T X = XX T X. Because 
rank(P) = d = rank(X), we have thatX T Xis non-singular and hence X = YV^^X)- 1 . 
Let W - YTXfXTX)- 1 . It is straightforward to verify that rank(W) = d and that W T W = 
I. W is thus an orthogonal matrix, and X = YW = UpSp W. Eq. (2) is thus established. 

Now, we will prove Eq. (3). Note that because the LY,} are i.i.d., the {X, } are 
exchangeable and hence identically distributed. As a result, each of the random 
variables ||X; — Xi\\ are identically distributed. Note that for sufficiently large n, by 
conditioning on the event in Eq. (2), we have 



, 3 log n 
(2df — f- 



2{d 2 + l) 
+ - — ^- L 2n = 



(i 2 log 77 



(13) 



because the worst case bound is ||X — X|| 2 < 2n with probability 1. We also have that 



E 



Yi{\\xt- 



■Xif > n~ r }n 



<E[||X-X|| 



(14) 



and because the \\Xt — X;| 
n 1 -r¥[\\X i -X i \\ 2 >n-r]. 



are identically distributed, the left hand side is simply 

□ 



5 Consistent Vertex Classification 

So far we have shown that using the eigen-decomposition of |A|, we can consistently 
estimate all latent positions simultaneously (up to an orthogonal transformation). 
One could imagine that this will lead to accurate inference for various exploitation 
tasks of interest. For example, Sussman e t al.| (In pr ess I explored the use of this 
embedding for unsupervised clustering of vertices in the simpler stochastic block- 
model setting. In this section, we will explore the implications of consistent latent 
position estimation in the supervised classification setting. In particular, we will 
prove that universally consistent classification using k -nearest-neighbors remains 
valid when we select the neighbors using the estimated vectors rather than the true 
but unknown latent positions. 

First, let us expand our framework. Let X c R d be as in section|3]and let Fx,y be 

a distribution on X x {0, 1}. Let {X lt Yi),{X 2 , Y 2 ), .. . ,{X n> Y n ), [X n+1 , Y n+l ) '~ F x ,y and 
let Pe ^# n+ i([0, 1]) and Ae ^„ + i({0, 1}) be as in section[3] Here the Y^s are the class 
labels for the vertices in the graph corresponding to the adjacency matrix A. 

We suppose that we observe only A, the adjacency matrix, and Y\,. .., Y n , the class 
labels for all but the last vertex. Our goal is to accurately classify this last vertex, so 

1 12 

for notational convenience define X :— X n+ \ and Y :— Y n+ \. Let the rows of U A S A " be 
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denoted by (J, The fc-nearest-neigbor rule for k odd is defined as follows. 
For 1 < i < n, let W n i[X) — l/k only if is one of the k nearest points to £ from 
among {£* i}" =l ; W„i[X) — otherwise. (We break ties by selecting the neighbor with 
the smallest index.) 

The k -nearest-neighbor rule is then given by h n [x) — I{^ i=1 W n i{X)Yi > |}. It is 
a well known theorem of Stone 1977) that, had we observed the original {X, }, the 
k -nearest neighbor rule using the Euclidean distance from {X, } to X is universally 
consistent provided k — » oo and k/n—>0. This means that for any distribution Fx,y, 

E[L„] := E[F[h n {X) ± Y\{X U Yy), {X 2 , Y 2 ), ...,{X n , Y n )]] -. F[h*{X) ± Y] =: L* (15) 

as n — > oo, where h n is the standard k -nearest-neighbor rule trained on the {{Xj, Yi)} 
and h* is the (optimal) Bayes rule. This theorem relies on the following very general 



result, also o f|Stone 1977) , see also |Devroye et al. 1996) , Theorem 6.3 



Theorem 5.1 Stone (1977 ). Assume that for any distribution ofX, the weights W„i 
satisfy the following three conditions: 

(i) There exists a constant c such that for every nonnegative measurable function f 
satisfyingE[f{X)] <oo, 



E 



2=1 



W ni (X)f(Xi) 



<cE[f(X)}. 



(16) 



(ii) For all a>0, 



limE 

n— >oo 



Y i W ni (xn\\X i -X\\>a) 



= 



(17) 



(Hi) 



lim E 

n— >ao 



max W„i(X) 

l<i<n 



= 



(18) 



Then h n (x) — W^, ; (x) > 1 /2} is universally consistent. 



Remark 5.2. Recall that the {X,} are defined in Theorem 4.1 Because the {Xj} are 
obtained via an orthogonal transformation of the id], the nearest neighbors of X — 
X n+ i are the same as those of £. As a result of this and the relationship between X 
and X, we work using the {Xt\, even though these cannot be known without some 
additional knowledge. 

To prove that the k -nearest-neighbor rule for the {Xi\ is universally consistent, 
we must show that the corresponding W n i satisfy these conditions. The methods to 
do this are adapted from the proof presented in |Devroye et al.| ( l996) . We will outline 
the steps of the proof, but the details follow mutatis mutandis from the standard 
proof. 

First, the following Lemma is adapted from Devroye et al. {1996 \ by using a tri- 
angle inequality argument. 
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Lemma 5.3. Suppose fc/n— >0. IfX e supp(.Fx), then \\X(k){X) — X\\ —> almost surely, 
where X(k)(X) is thek-th nearest neighbor of X among{Xi}" =1 . 

Condition |m} follows immediately from the definition of the W„i. The remain- 
der of the proof follows with few changes after recognizing that the random variables 
{(X,X)} are exchangeable. Overall, we have the following universal consistency re- 
sult. 

Theorem 5 .4. I f k —> oo andk/n — > as ?z — > oo, then the W„t(X) satisfy the condtions 



of Theorem 5. 1 and hence E [P[ h„ {X) ^ Y\A, { Y t } " =1 } = E [L„ ] -> L* 



6 Extensions 

The results presented thus far are for the specific problem of determining one unob- 
served class label for a vertex in a random dot product graph. In fact, the techniques 
used can be extended to somewhat more general settings without significant addi- 
tional work. 



6.1 Classification 

For example, the results in section [5] are stated in the case that we have observed 
the class labels for all but one vertex. However, the universal consistency of the k- 
nearest-neighbor classifier remains valid provided the number of vertices m with 
observed vertex class labels goes to infinity and k/m —> as the number of vertices 
n —> oo. In other words, we may train the k -nearest neighbor on a smaller subset of 
the estimated latent vectors provided the size of that subset goes to oo. 

On the other hand, if we fix the number of observed class labels m and the clas- 
sification rule h m and let the number of vertices tend to oo, then we can show the 
probability of incorrectly classifying a vertex will converge to L m — P[h m {Z) / Y]. 
Additionally, our results also hold when the class labels Y can take more than two 
but still finitely many values. 

In fact, the results in section|5]and Eq. |3) from Theorem |4.1| rely only on the fact 
that the {X,} are i.i.d. and bounded, the {{Xi,Xi)} are exchangeable, and ||X — X||| 
can be bounded with high probability by a O(logra) function. The random graph 
structure provided in our framework is of interest, but it is the total noise bounds 
that are crucial for the universal consistency claim to hold. 



6.2 Latent Position Estimation 

In section [4] we state our results for the random dot product graph model. We 
can generalize our results immediately by replacing the dot product with a bi-linear 
form, g{x,y) — x T (Irf' © (— \d"))y , where 1^ is the d x d identity matrix. This model 
has the interpretation that similarities in the first d' dimensions increase the proba- 
bility of adjacency, while similarities in the last the last d" reduce the probability of 
adjacency. All the results remain valid under this model, and in fact, arguments in 
|Oliveira| ( 2009 1 can be used to show that the signature of the bi-linear form can also 
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be estimated consistently. We also recall that the assumption of distinct eigenvalues 
forE[XX T ] can be removed with minor changes. Particularly, Lemma 4.4 applies to 



groups of eigenvalues, and subsequent results can be adapted without changing the 
order of the bounds. 

This work focuses on undirected graphs and this assumption is used explicitly 
throughout section|4] We believe moderate modifications would lead to similar re- 
sults for directed graphs, such as in |Sussman et al.|(In press} ; however at present we 
do not investigate this problem. We also note that we assume the graph has no loops 
so that A is hollow. This assumption can be dropped, and in fact, the impact of the 



diagonal is asymptotically negligible, provided each entry is bounded. Marchette 



et al. (2011} suggest that augmenting the diagonal may improve latent position esti- 



mation for finite samples. 

In Rohe et al. 12011 1, the number of blocks in the stochastic blockmodel, which 
is related to d in our setting i Sussman et al. In press i, is allowed to grow with n; our 
work can also be extended to this setting. In this case, it will be the interaction be- 
tween the rate of growth of d and the rate that 5 vanishes that controls the bounds 
in Theorem 4.1 Additionally, the consistency of k -nearest-neighbors when the di- 
mension grows is less well understood and results such as Stone's Theorem 5.1 do 
not apply. 

In addition to keeping d fixed, we also assume that d is known 



Fishkind et al. 



(20121 and Sussman et al. I In press I suggest consistent methods to estimate the 
latent space dimension. The results in Oliveiraj (2009} can also be used to derive 
thresholds for eigenvalues to estimate d. 

2011 1 also consider that the 



Finally, Fishkind et al. 2012 and Marchette et al. 



edges may be attributed; for example, if edges represent a communication, then the 
attributes could represent the topic of the communication. The attributed case can 
be thought of as a set of adjacency matrices, and we can embed each separately and 
concatenate the embeddings. Fishkin d et al.| (2012} argues that this method works 
under the attributed stochastic blockmodel and similar arguments could likely be 
used to extend the current work. 



6.3 Extension to the Laplacian 



The eigen-decomposition of the graph Laplacian is also widely used for similar in- 
ference tasks. In this section, we argue informally that our results extend to the 
Laplacian. We will consider a slight modification of the standard normalized Lapla- 



cian as defined in |Rohe et al.|(2011} . This modification scales the Laplacian in |Rohe| 
et al. 2011} by n — 1 so that the first d eigenvalues of our matrix are 0{n) rather then 
0(1) for the standard normalized Laplacian. 

Let L := D^AIT 1 / 2 where D is diagonal with D, 
let Q := SH^PEr 1 / 2 where D is diagonal with 



^Xr=i A <7- Additionally, 



D„ := 



1 



-E 



X 



n — 1 1 n — 1 



(19) 
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Finally, define q : R d x R d -» R d as q{x,y) := ~j=yZi ■= q{X u \ Xj^i X j) andZ,- := 
q{X h E(X)). 

Because the pairwise dot products of the rows of D ^ 2 X are the same as the en- 
tries of Q, the scaled eigenvectors of Q must be an orthogonal transformation of 
the {Zj}. Further, note that for large n, Z, and Z, will be close with high probability 
because ^Xwi^O and the function q{Xj,-) is smooth almost surely. Addi- 

tionally, the \Z{\ are i.i.d. and <7(-,E[X]) is one-to-one so that the Bayes optimal error 
rate is the same for the {Z, } as for the {Xi}: L* x — L*~. If the further assumption that 

the minimum expected degree among all vertices is greater than </2 n / ^/logn holds, 
then the assumptions of Theorem 2.2 in Rohe et al. (2011) are satisfied. 

Let Zi denote the z' th row of the matrix UlSl defined analog ously to section [3 ] 



Rohe et al. 



2011 



and let Z be the matrix with row i given by Zj . Using the results in ] 
and similar tools to those we have used thus far, one can show that minw ||UlSlW — 
Z|| 2 can be bounded with high probability by a function in O(logn). As discussed 
above, this is sufficient for A; -nearest-neighbors trained on {(Zi, Yi)} to be universally 
consistent. In this paper we do not investigate the comparative values of the eigen- 
decompositions for the Laplacian versus the adjacency matrix, but one factor may 
be the properties of the map q defined above as applied to different distributions on 



7 Experiments 

In this section we present empirical results for a graph derived from Wikipedia links 
as well as simulations for an example wherein the {X, } arise from a Dirichlet distri- 
bution. 



7.1 Simulations 

To demonstrate our results, we considered a problem where perfect classification is 
possible. Each X , : Q >— > R 2 is distributed according to a Dirichlet distribution with 
parameter a — [2, 2, 2] T where we keep just the first two coordinates. The class labels 
are determined by the X,- with Yi — \\Xi\ < Xi 2 ] so in particular L* — 0. 

For each n e {100,200,. ..,2000}, we simulated 500 instances of the {X,} and 
sample the associated random graph. For each graph, we used our technique to em- 
bed each vertex in two dimensions. To facilitate comparisons, we used the matrix X 
to construct the matrix X via transformation by the optimal orthogonal W. Figure [I] 
illustrates our embedding for n — 2000 with each point corresponding to a row of X 
with points colored according the class labels { Yi}. To demonstrate our results from 
section|4] figure [2] shows the average square error in the latent position estimation 
per vertex. 

For each graph, we used leave-one-out cross validation to evaluate the error rate 
for k -nearest-neighbors for k — 2[^/n/A\ + 1. We suppose that we observe all but 1 
class label as in section [5] Figure [3] shows the classification error rates. The black 
line shows the classification error when classifying using X while the red line shows 
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0.0 0.2 0.4 0.6 0.8 

Figure 1: An example of estimated latent position {X, } for the distribution described 
in section |7Jj Each point is colored according to class labels {Y,}. For the original 
latent position {Xi}, the two classes would be perfectly separated by the line y — x. 
In this figure the two classes are nearly separated but have some overlap. Note also 
that some estimated positions are outside the support of the original distribution. 

the classification error when classifying using X. Unsurprisingly, classifying using X 
gives worse performance. However we still see steady improvement as the number 
of vertices increases, as predicted by our universal consistency result. Indeed, this 
figure suggests that the rates of convergence may be similar for both X and X. 

7.2 Wikipedia Graph 

For this data I Ma e tal.||2012) ,http : //www. cis . jhu . edu/~zma/ zmisi09 .html I, 
each vertex in the graph corresponds to a Wikipedia page and the edges correspond 
to the presence of a hyperlink between two pages (in either direction). We consider 
this as an undirected graph. Every article within two hyperlinks of the article "Al- 
gebraic Geometry" was included as a vertex in the graph. This resulted in n — 1382 
vertices. Additionally, each document, and hence each vertex, was manually labeled 
as one of the following: Category (119), Person (372), Location (270), Date (191) and 
Math (430). 

To investigate the implications of the results presented thus far, we performed a 
pair of illustrative investigations. First, we used our technique on random induced 
subgraphs and used leave-one-out cross validation to estimate error rates for each 
subgraph. We used k — 9 and d = 10 and performed 100 monte carlo iterates of 
random induced subgraphs with n e {100,200, 1300} vertices. Figure [4] shows 
the mean classification error estimates using leave-one-out cross validation on each 
randomly selected subgraph. Note, the chance error rate is 1 - 430/1382 = 0.689. 
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Figure 2: Mean square error versus number of vertices. This figure shows the mean 
square error in latent position estimation per vertex, given by ||X — X\\ z P /n, for the 
simulation described in section [7T] The error bars are given by the standard de- 
viation of the average square error over 500 monte carlo replicates for each n. On 
average, the estimated latent positions converge rapidly to the true latent positions 
as the number of vertices in the graph increases. 




Figure 3: Leave-one-out cross validation classification error estimates using k- 
nearest neighbors for the simulations described in section [7Tj The black line show 
the classification error when classifying using X while the red line shows the error 
rates when classifying using X. Error bars show the standard deviation over the 500 
monte carlo replicates. Chance classification error is 0.5; L* — 0. This figure suggests 
the rates of convergence may be similar for both X and X 
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n — number of vertices in subgraph 



Figure 4: Error rate using leave-one-out cross validation for random induced sub- 
graphs. Chance classification error is ss 0.688 shown in blue. This illustrates the 
improvement vertex classification as the number of vertices and the number of ob- 
served class labels increases. 
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Figure 5: Leave-one-out error rate plotted against the embedding dimension d for 
different choices of k (see legend). Each line corresponds to a different choice for 
the number of nearest neighbors k. All results are better than chance w 0.688. We 
see that method is robust to changes of k and d near the optimal range. 



We also investigated the performance of our procedure for different choices of 
d, the embedding dimension, and k, the number of nearest neighbors. Because this 
data has 5 classes, we use the standard A;-nearest-neighbor algorithm and break ties 
by choosing the first label as ordered above. Using leave-one-out cross validation, 
we calculated an estimated error rate for each defl, . ..,50} and k e {1,5,9, 13, 17}. 
The results are shown in Figure [5] This figure suggests that our technique will be 
robust to different choices of k and d within some range. 
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8 Conclusion 



Overall, we have shown that under the random dot product graph model, we can 
consistently estimate the latent positions provided they are independent and iden- 
tically distributed. We have shown further that these estimated positions are also 
sufficient to consistently classify vertices. We have shown that this method works 
well in simulations and can be useful in practice for classifying documents based on 
their links to other documents. 
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